Performance Results on the Intel Touchstone Gamma Prototype
نویسنده
چکیده
This paper describes the Intel Touchstone Gamma Prototype, a distributed memory MIMD parallel computer based on the new Intel i860 floating point processor. With 128 nodes, this system has a theoretical peak performance of over seven GFLOPS. This paper presents some initial performance results on this system, including results for individual node computation, message passing and complete applications using multiple nodes. The highest rate achieved on a multiprocessor Fortran application program is 844 MFLOPS. Overview of the Touchstone Gamma System In spring of 1989 DARPA and Intel Scientific Computers announced the Touchstone project. This project calls for the development of a series of prototype machines by Intel Scientific Computers, based on hardware and software technologies being developed by Intel in collaboration with research teams at CalTech, MIT, UC Berkeley, Princeton, and the University of Illinois. The eventual goal of this project is the Sigma prototype, a 150 GFLOPS peak parallel supercomputer, with 2000 processing nodes. One of the milestones towards the Sigma prototype is the Gamma prototype. At the end of December 1989, the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center took delivery of one of the first two Touchstone Gamma systems, and it became available for testing in January 1990. The Touchstone Gamma system is based on the new 64 bit i860 microprocessor by Intel [4]. The i860 has over 1 million transistors and runs at 40 MHz (the initial Touchstone Gamma systems were delivered with 33 MHz processors, but these have since been upgraded to 40 MHz). The theoretical peak speed is 80 MFLOPS in 32 bit floating point and 60 MFLOPS for 64 bit floating point operations. The i860 features 32 integer address registers, with 32 bits each, and 16 floating point registers with 64 bits each (or 32 floating point registers with 32 bits each). It also features an 8 kilobyte onchip data cache and a 4 kilobyte instruction cache. There is a 128 bit data path between cache and registers. There is a 64 bit data path between main memory and registers. The i860 has a number of advanced features to facilitate high execution rates. First of all, a number of important operations, including floating point add, multiply and fetch from main memory, are pipelined operations. This means that they are segmented into three stages, and in most cases a new operation can be initiated every 25 nanosecond clock period. Another advanced feature is the fact that multiple instructions can be executed in a single clock period. For example, a memory fetch, a floating add and a floating multiply can all be initiated in a single clock period. A single node of the Touchstone Gamma system consists of the i860, 8 megabytes (MB) of dynamic random access memory, and hardware for communication to other nodes. The Touchstone Gamma system at NASA Ames consists of 128 computational nodes. The theoretical peak performance of this system is thus approximately 7.5 GFLOPS on 64 bit data. The 128 nodes are arranged in a seven dimensional hypercube using the direct connect routing module and the hypercube interconnect technology of the iPSC/2. The point to point aggregate bandwidth of the interconnect system, which is 2.8 MB/sec per channel, is the same as on the iPSC/2. However the latency for the message passing is reduced from about 350 microseconds to about 90 microseconds. This reduction is mainly obtained through the increased speed of the i860 on the Touchstone Gamma machine, when compared to
منابع مشابه
One year with an iPSC/860
This paper describes experiences over the past year with an the Intel iPSC/860, a distributed memory MIMD parallel computer based on the Intel i860 oat-ing point processor. The system at NASA Ames Research Center has 128 nodes, and a theoretical peak performance of over seven GFLOPS. This paper describes the system at Ames Research Center, talks about system stability, compiler performance meas...
متن کاملMatrix Multiplication on the Intel Touchstone Delta
Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We obtain an implementation that uses communications primitives highly suited to the Delta and exploits the s...
متن کاملCommunication on the Paragon
In this note we describe the results of some tests of the message-passing performance of the Intel Paragon. These tests have been carried out under both the Intel-supplied OSF/1 operating system with an NX library, and also under an operating system called SUNMOS (Sandia UNM Operating System). For comparison with the previous generation of Intel machines, we have also included the results on th...
متن کاملComplete Exchange on a Wormhole Routed Mesh
The complete exchange (or all-to-all personalized) communication pattern occurs frequently in many important parallel computing applications. We discuss several algorithms to perform complete exchange on a two dimensional mesh connected computer with worm-hole routing. We propose algorithms for both power-of-two and non power-of-two meshes as well as an algorithm which works for any arbitrary m...
متن کاملAll-to-All Communication on Meshes with Wormhole Routing
This paper describes several algorithms to perform all-to-all communication on a two-dimensional mesh connected computer with wormhole routing. We discuss both direct algorithms, in which data is sent directly from source to destination processor, and indirect algorithms in which data is sent through one or more intermediate processors. We propose algorithms for both power-of-two and non power-...
متن کامل